Skip to content

JAVA-5949 preserve connection pool on backpressure errors when establishing connections#1900

Open
nhachicha wants to merge 25 commits intomongodb:backpressurefrom
nhachicha:nh/backpressure/preserve_connection_pool
Open

JAVA-5949 preserve connection pool on backpressure errors when establishing connections#1900
nhachicha wants to merge 25 commits intomongodb:backpressurefrom
nhachicha:nh/backpressure/preserve_connection_pool

Conversation

@nhachicha
Copy link
Copy Markdown
Collaborator

@nhachicha nhachicha commented Feb 25, 2026

Original PR #1854 accidentally closed, and had no outstanding review comments.

we should merge the latter into main, then merge this PR in backpressure, and leave the necessary TODO comments in #1918 related to the dependency on #1856.

  • then, when there are no outstanding PRs for the backpressure branch, rebase backpressure on top of main to include the changes from the main (see Merge main into backpressure #1917 (comment) for more context on this process).
  • then address the aforementioned TODO comments.

Specification changes:

Note that there was a change in the description of one of the tests that belongs to this PR, but I closed the corresponding JAVA-6095 (see this comment): DRIVERS-3394: adjust spec test description (#1894).

JAVA-5949, JAVA-6056, JAVA-5664

AI usage and effectiveness:

  • Model Opus 4.6 then 4.7
  • Used as an implementer + paired reviewer
  • Spec audits
  • Most architectural decisions were grounded in "what do Node and .NET do?" in addition to the reasoning from the spec text.

@nhachicha nhachicha self-assigned this Feb 25, 2026
@nhachicha nhachicha requested a review from vbabanin February 25, 2026 16:56
@nhachicha nhachicha marked this pull request as ready for review February 25, 2026 16:56
@nhachicha nhachicha requested a review from a team as a code owner February 25, 2026 16:56
@codeowners-service-app
Copy link
Copy Markdown

Assigned stIncMale for team dbx-java because vbabanin is out of office.

@stIncMale
Copy link
Copy Markdown
Member

Added the following to the description of the PR (a new piece of work to be done within the PR): DRIVERS-3394: adjust spec test description (#1894)(JAVA-6095).

boolean terminated = executor.awaitTermination(20, SECONDS);
assertTrue("Executor did not terminate within timeout", terminated);

// Assert at least 10 ConnectionCheckOutFailedEvents occurred
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this comment? The assertion message already explains the intent (e.g., “Expected at least 10 ConnectionCheckOutFailedEvents, but got …”).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

assertTrue("Expected at least 10 ConnectionCheckOutFailedEvents, but got " + connectionCheckOutFailedEventCount.get(),
connectionCheckOutFailedEventCount.get() >= 10);

// Assert 0 PoolClearedEvents occurred
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +341 to +344
// Teardown: sleep 1 second and reset rate limiter
Thread.sleep(1000);
adminDatabase.runCommand(new Document("setParameter", 1)
.append("ingressConnectionEstablishmentRateLimiterEnabled", false));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup is currently conditional on the code above completing successfully. If an assertion or exception happens earlier, this teardown won’t run, which can leak state into subsequent tests.

We should move it this cleanup into @AfterEach/afterEach so it runs reliably regardless of how the test exits.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wrapped under a finally block now

Comment on lines +283 to +296
AtomicInteger connectionCheckOutFailedEventCount = new AtomicInteger(0);
AtomicInteger poolClearedEventCount = new AtomicInteger(0);

ConnectionPoolListener connectionPoolListener = new ConnectionPoolListener() {
@Override
public void connectionCheckOutFailed(final ConnectionCheckOutFailedEvent event) {
connectionCheckOutFailedEventCount.incrementAndGet();
}

@Override
public void connectionPoolCleared(final ConnectionPoolClearedEvent event) {
poolClearedEventCount.incrementAndGet();
}
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing a new anonymous listener with counters, we can reuse the existing test listener:

TestConnectionPoolListener connectionPoolListener = new TestConnectionPoolListener();

It already provides await helpers that double as assertions and helpers to assert that zero PoolClearedEvents happened, e.g.:

connectionPoolListener.waitForEvent(ConnectionPoolClearedEvent.class, 1, 110, SECONDS);

This keeps the test more concise and reuses established utilities for clarity/consistency.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// clear the pool as they're not related to overload.
// TLS configuration errors (certificate validation, protocol mismatches) should also clear the pool
// as they indicate configuration issues, not server overload.
if (beforeHandshake && !sdamIssue.relatedToAuth() && !sdamIssue.relatedToTlsConfigurationError()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we attach SystemOverloadedError and RetryableError labels in the SDAM error-handling path (effectively only for DefaultServer). In load-balanced mode, SDAM isn’t involved: the LB code path invalidates the pool directly (e.g., connectionPool.invalidate(serviceId, generation)), so the labeling logic is bypassed.

This means users running the driver in LB mode (behind an NLB) can still hit network errors, TLS handshake failures, timeouts during connection establishment or hello, but won’t get the labels.

However, these labels are a CMAP requirement, not SDAM. The CMAP spec states: “The pool MUST add the error labels SystemOverloadedError and RetryableError to network errors or network timeouts it encounters during the connection establishment or the hello message.”

Since this is defined as a pool behavior (topology-agnostic), it seems we should implement the labeling in the connection pool layer so it applies consistently in both default and load-balanced modes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// clear the pool as they're not related to overload.
// TLS configuration errors (certificate validation, protocol mismatches) should also clear the pool
// as they indicate configuration issues, not server overload.
if (beforeHandshake && !sdamIssue.relatedToAuth() && !sdamIssue.relatedToTlsConfigurationError()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don’t distinguish DNS lookup failures (UnknownHostException) from other connection-establishment network errors. As a result, a DNS failure goes through the generic path (same as connection reset/timeout) and gets SystemOverloadedError/RetryableError labels.

The CMAP spec excludes DNS failures from backpressure labeling: `“For errors that the driver can distinguish as never occurring due to server overload, such as DNS lookup failures […] the driver MUST NOT add backpressure error labels for these error types.”.

Proposed change: detect DNS failure by walking the exception cause chain for UnknownHostException (it’s wrapped as MongoSocketException from ServerAddressHelper.getSocketAddresses()), and when present, skipping backpressure label attachment so SDAM follows the normal path (clear the pool and mark the server Unknown).

In that case, we should add coverage to assert that labeling and pool clearing behaviour. If the driver ever changes the wrapper exception type MongoSocketException (or stops wrapping UnknownHostException this way) and starts adding labels the test should fail.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SDAM/backpressure handling so that connection-establishment failures caused by server-side backpressure do not clear the connection pool, and adds/updates tests and matchers to align with the SDAM spec test expectations.

Changes:

  • Adjust SDAM exception handling to avoid invalidating the pool on certain pre-handshake network failures, and add TLS-configuration-error classification.
  • Add new error label constants (SystemOverloadedError, RetryableError) and apply them for backpressure-related failures.
  • Extend unified event matching for additional server types and add a prose test for connection pool backpressure behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
driver-sync/src/test/functional/com/mongodb/client/unified/EventMatcher.java Extends server type matching for SDAM serverDescriptionChangedEvent expectations.
driver-sync/src/test/functional/com/mongodb/client/ServerDiscoveryAndMonitoringProseTests.java Adds a prose test intended to validate connection pool backpressure behavior via server rate limiter parameters.
driver-core/src/test/unit/com/mongodb/internal/connection/DefaultServerSpecification.groovy Updates unit tests to expect pool preservation on network errors during connection establishment (sync/async).
driver-core/src/main/com/mongodb/internal/connection/SdamServerDescriptionManager.java Adds TLS configuration error detection helper on SDAM issues.
driver-core/src/main/com/mongodb/internal/connection/DefaultSdamServerDescriptionManager.java Changes SDAM exception handling to preserve pool/server description for certain pre-handshake failures and apply labels.
driver-core/src/main/com/mongodb/MongoException.java Introduces new public error label constants used by the backpressure handling path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +53 to +68
/**
* An error label indicating that the server is overloaded.
*
* @see #hasErrorLabel(String)
* @since 5.7
*/
public static final String SYSTEM_OVERLOADED_ERROR_LABEL = "SystemOverloadedError";

/**
* An error label indicating that the operation is safely retryable.
*
* @see #hasErrorLabel(String)
* @since 5.7
*/
public static final String RETRYABLE_ERROR_LABEL = "RetryableError";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced these constants in #1918. There is more to this than the change do in the current PR. Let's revert the change in the current PR to resolve the conflict with backpressure.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +141 to +145
// Backpressure spec: Don't clear pool or mark server unknown for connection establishment failures
// (network errors or timeouts during handshake). Authentication errors after handshake should still
// clear the pool as they're not related to overload.
// TLS configuration errors (certificate validation, protocol mismatches) should also clear the pool
// as they indicate configuration issues, not server overload.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't usually copy the specification prose into code as comments. I suspect, this comment was left by the AI.

Let's remove this comment, unless there is a good reason to leave it, which was not applicable to the rest of the code in this file that existed before the current PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +147 to +149
// Don't update server description to Unknown
// Don't invalidate the connection pool
// Apply error labels for backpressure
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't usually copy the specification prose into code as comments. I suspect, this comment was left by the AI.

Let's remove this comment, unless there is a good reason to leave it, which was not applicable to the rest of the code in this file that existed before the current PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Member

@stIncMale stIncMale Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[this comment is left on a file, but has nothing to do with the file; this is done so that we could reply to the comment; commenting on a PR does not allow replies - horrendous GitHub functionality]

The current PR seemingly depends on #1856 (see the description of the current PR). We need to decide what to do with that.

P.S. I originally left my thoughts in the description of this PR, but I don't know if that's what we are going to do.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer depend on this PR see #1900 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[this comment is left on a file, but has nothing to do with the file; this is done so that we could reply to the comment; commenting on a PR does not allow replies - horrendous GitHub functionality]

@nhachicha

  • Could you please confirm that all the specification changes listed in the "Specification changes" part of the description of the current PR have been addressed in the PR?
  • If they have been addressed, let's update the description correspondingly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description updated. All specs are implemented

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[this comment is left on a file, but has nothing to do with the file; this is done so that we could reply to the comment; commenting on a PR does not allow replies - horrendous GitHub functionality]

JAVA-5949 has an old comment, which instructs to re-enable some tests that were previously disabled when we updated the testing/resources/specifications submodule in main. Those tests were not enabled. Let's enable them and make sure they pass.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR no longer depends on #1856. Spec PR mongodb/specifications#1880 rewrote the backpressure-network-*-fail.yml tests to wait on serverDescriptionChangedEvent instead of serverHeartbeatSucceededEvent.

@nhachicha nhachicha force-pushed the nh/backpressure/preserve_connection_pool branch from 0be0250 to bce97d4 Compare April 22, 2026 17:16
…veryAndMonitoringProseTests.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread driver-sync/src/test/functional/com/mongodb/client/unified/EventMatcher.java Outdated
stIncMale and others added 5 commits April 27, 2026 15:58
- Deprioritize sharded clusters on any error, all other topologies only on SystemOverloadedError.
- Pass ClusterType to updateCandidate so onAttemptFailure can distinguish topology types.
- Add retryable reads prose tests 3.1 and 3.2.
- Change ServerSelectionSelectionTest to use BaseCluster server selection chain.

JAVA-6105
JAVA-6021
JAVA-6074
---------
Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>
Co-authored-by: Ross Lawley <ross.lawley@gmail.com>
- Add enableOverloadRetargeting boolean option to MongoClientSettings and ConnectionString to allow
  the driver to route requests to a different replica set member on retries when the previously
  used server is overloaded
- Add prose test 3.3 to verify that overload errors are retried on the same server when retargeting
  is disabled

JAVA-6167
---------

Co-authored-by: Ross Lawley <ross.lawley@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ntMatcher.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +285 to +293
MongoClientSettings clientSettings = getMongoClientSettingsBuilder()
.applyToConnectionPoolSettings(builder -> builder
.maxConnecting(100)
.addConnectionPoolListener(connectionPoolListener))
.build();

try (MongoClient adminClient = MongoClients.create(getMongoClientSettingsBuilder().build());
MongoClient client = MongoClients.create(clientSettings)) {

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testConnectionPoolBackpressure uses getMongoClientSettingsBuilder() (no direct connection) and then runs setParameter against admin. In sharded/load-balanced test environments this client may be connected to a mongos/LB endpoint, where these ingressConnectionEstablishment* parameters are not expected to be configurable, causing the test to fail or not exercise the intended server behavior. Consider either skipping the test when ClusterFixture.isSharded()/ClusterFixture.isLoadBalanced() or forcing a direct connection to a mongod (e.g., via applyToClusterSettings(ClusterFixture::setDirectConnection) / connecting to the primary address explicitly).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rejected: deviate from the spec

Comment on lines +73 to +100
private static boolean isTlsConfigurationError(final Throwable t) {
Throwable cause = t.getCause();
while (cause != null) {
if (cause instanceof CertificateException
|| cause instanceof CertPathBuilderException
|| cause instanceof CertPathValidatorException
|| cause instanceof SSLPeerUnverifiedException
|| cause instanceof SSLProtocolException) {
return true;
}
if (cause instanceof SSLHandshakeException) {
String message = cause.getMessage();
if (message != null) {
String lowerMessage = message.toLowerCase(Locale.ROOT);
if (lowerMessage.contains("certificate")
|| lowerMessage.contains("verify")
|| lowerMessage.contains("trust")
|| lowerMessage.contains("hostname")
|| lowerMessage.contains("protocol")
|| lowerMessage.contains("cipher")
|| lowerMessage.contains("handshake_failure")) {
return true;
}
}
}
cause = cause.getCause();
}
return false;
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TLS-configuration-error exclusion logic in isTlsConfigurationError is new, but there is no unit test asserting that common TLS misconfiguration failures (e.g., certificate validation / handshake failures) are not labeled as SystemOverloadedError and therefore still lead to normal pool invalidation. Adding a focused unit test similar to the DNS lookup failure test would help prevent regressions in this classification logic.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +803 to 805
case "connectionClosedEvent":
case "connectionCheckedInEvent":
context.getEventMatcher().waitForConnectionPoolEvents(clientId, event, count, entities.getConnectionPoolListener(clientId));
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executeWaitForEvent now supports waiting for connectionClosedEvent / connectionCheckedInEvent, but executeAssertEventCount (and EventMatcher.assertConnectionPoolEventCount) still only supports poolClearedEvent / poolReadyEvent. If any unified tests use assertEventCount for these new connection events, they will throw UnsupportedOperationException. Please extend executeAssertEventCount (and the matcher) to support counting these event types as well (or confirm the unified format only uses waitForEvent for them and document that constraint).

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +217
case "connectionClosedEvent":
eventClass = ConnectionClosedEvent.class;
break;
case "connectionCheckedInEvent":
eventClass = ConnectionCheckedInEvent.class;
break;
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForConnectionPoolEvents was extended to support connectionClosedEvent / connectionCheckedInEvent, but assertConnectionPoolEventCount still only supports poolClearedEvent / poolReadyEvent. This asymmetry can break unified tests that use assertEventCount for these new connection pool events. Please add these event types to assertConnectionPoolEventCount (and mirror the support in UnifiedTest.executeAssertEventCount).

Copilot uses AI. Check for mistakes.
executor.awaitTermination(20, SECONDS));
} finally {
if (!executor.isTerminated()) {
executor.shutdownNow();
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After calling shutdownNow() when the executor doesn’t terminate in time, the test doesn’t do a second awaitTermination. If driver I/O doesn’t respond promptly to interrupts, threads may linger and destabilize/hang subsequent tests. Consider awaiting termination after shutdownNow() (with a short timeout) and failing explicitly if it still won’t terminate, to ensure the suite can reliably progress.

Suggested change
executor.shutdownNow();
executor.shutdownNow();
try {
if (!executor.awaitTermination(5, SECONDS)) {
fail("Executor did not terminate after shutdownNow");
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
fail("Interrupted while awaiting executor termination after shutdownNow");
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants